Skip to content

Phase 6.4: epoll-driven RX dispatch with adaptive timeout (3.5× throughput vs main)#69

Merged
dpsoft merged 122 commits intomainfrom
smoltcp-passt-port-phase6.4-epoll
May 6, 2026
Merged

Phase 6.4: epoll-driven RX dispatch with adaptive timeout (3.5× throughput vs main)#69
dpsoft merged 122 commits intomainfrom
smoltcp-passt-port-phase6.4-epoll

Conversation

@dpsoft
Copy link
Copy Markdown
Contributor

@dpsoft dpsoft commented May 3, 2026

Status: DRAFT. Stacked on top of PR #68 (`smoltcp-passt-port-phase0` — Phases 0–5 + 5.5b). Implements Phase 6.4 of `docs/superpowers/plans/2026-04-30-smoltcp-passt-port.md` per the detailed TDD plan in `docs/superpowers/plans/2026-04-30-smoltcp-passt-port-phase6.4.md`.

What this branch does

Replaces `net_poll_thread`'s timer-driven `sleep(5 ms)` cycle with an event-driven `epoll_wait` dispatch using an adaptive timeout (5 ms when active, 50 ms when idle). Host TCP/UDP/ICMP socket FDs are registered with a shared `EpollDispatch` on every `flow_table` insert; `net_poll_thread` blocks on `epoll_wait` and feeds ready events to `SlirpBackend::drain_to_guest` via a small per-backend event queue. A self-pipe `Waker` lets `process_guest_frame` (vCPU thread) wake the poll thread immediately when it queues frames.

Headline perf vs `origin/main` (`voidbox-network-bench --iterations 3 --bulk-mb 10`)

Metric Master This branch Δ
TCP g2h throughput 1885 Mbps 6680 Mbps +254% (3.5×)
TCP bulk-g2h (SO_RCVBUF=4096) 1565 Mbps 5650 Mbps +261% (3.6×)
TCP CRR p50 ~10 ms ~10 ms parity
TCP RR p50 2 µs 1-3 µs parity (within noise)

Divan microbench wins (`scripts/bench-compare.sh --baseline 47868f0 --skip-vm`)

The big-N case is the design's headline win — only ready flows get visited, no full-table scan:

Bench Baseline This branch Δ%
`poll_with_n_mixed_flows/999` 304 µs 8.5 µs -97%
`poll_with_n_mixed_flows/99` 29.3 µs 3.81 µs -87%
`poll_with_n_mixed_flows/3` 1.11 µs 651 ns -41%
`flow_table_insert_remove/1000` 23.3 µs 24.4 µs +5%
`tcp_bulk_throughput_1mb` 59.1 ms 59.8 ms +1% (parity)

Adaptive timeout: how it preserves correctness

The naive `epoll_wait(5 ms fixed)` recovered CRR but burned 200 wakes/sec when idle. The naive `epoll_wait(50 ms fixed)` recovered idle CPU but caused +40 ms CRR latency (Linux delayed-ACK timer).

Adaptive: last cycle had any kernel event ⇒ next timeout = 5 ms (active); last cycle pure timeout ⇒ next timeout = 50 ms (idle). One quiet cycle drops to idle, one event back to active.

The subtle part: when `epoll_waker.wake()` fires during a 50 ms idle wait, the kernel's `epoll_wait` returns with the self-pipe event. `wait_with_timeout` filters that event out — so a naive "is_empty ⇒ idle" predicate would have stayed at 50 ms, regressing CRR. `wait_with_timeout` now returns the raw kernel count (including filtered self-pipe wakes); adaptive treats wakes as activity. Filtered events still arrive in the out parameter unchanged.

Inherent microbench overhead (test/bench fallback path only)

Microbenches without a `net_poll_thread` fall back to `drain_to_guest`'s self-poll path (one mutex op + one non-blocking `epoll_wait` syscall per call, ~310 ns). This does NOT apply to production, where `has_external_poller=true` after the first `push_ready_events` skips the syscall (~50 ns total).

Bench (test path) Baseline This branch Note
`poll_idle` 160 ns 441 ns inherent fallback cost
`poll_with_n_flows/*` 159-180 ns 440-441 ns inherent fallback cost
`dns_cache_hit` 190 ns 1.03 µs inherent fallback cost
`process_arp_request` 4.91 µs 18.1 µs EpollDispatch::new() per iteration in benches that build fresh stacks (~10 µs setup amortized in production once per VM)

Iteration history

The first cut of Phase 6.4 caused a 40× throughput regression in production (1885 → 44 Mbps). Six fix passes with eBPF profiling (perf-agent) restored it:

  1. `ed048e5` — eliminate epoll mutex contention. `net_poll_thread` held the EpollDispatch mutex for the full 50 ms of its blocking wait; `drain_to_guest` contended on the same mutex. Per-backend `pending_events` queue replaces the contention.
  2. `17b437b` — replace two-pass relay sweep with lazy close queue. The Pass-1 `flow_table.keys().copied().collect()` was 47 cache misses/1K instructions — now FIN/RST handlers push to `pending_close` directly.
  3. `bdef4bd` — drain `pending_events` first, fall back to non-blocking poll only if empty. The original try_lock-vs-pending_events branch dropped events the net-poll thread had already pushed.
  4. `15231cb` — wake `net_poll_thread` when `process_guest_frame` queues frames. Pre-Phase-6.4 the 5 ms sleep masked the absence of an explicit wake; with `epoll_wait` waiting on FD readiness, queued ACKs sat 50 ms before flushing because the guest is the writer (no FD-side signal). This was the throughput unblocker (44 → 6000 Mbps).
  5. `bebeb30` — `epoll_wait` timeout 50 ms → 5 ms. CRR p50 was stuck at 51 ms (one full epoll cycle) because the Linux guest relies on regular IRQ pulses to advance its TCP delayed-ACK timer. 5 ms cadence mirrors pre-Phase-6.4; fast paths still wake immediately via epoll readiness. Restored CRR to 10 ms baseline.
  6. `1d3e816` — added `has_external_poller` flag (skips the fallback syscall when `push_ready_events` is feeding events) and stripped `Phase-N` references from inline comments per AGENTS.md style guide.
  7. `9a46865` — adaptive 5 ms / 50 ms timeout. Recovers Phase 6.4's headline 10× idle-wakeup reduction while preserving CRR parity. Required exposing raw kernel event count from `wait_with_timeout` so adaptive treats self-pipe wakes as activity.

Profile evidence (post-fix, 60 s capture during voidbox-network-bench)

  • IPC: 0.544 (was 0.413 pre-fix; threshold for memory-bound is <0.8 — improved, still bound).
  • Cache misses: 38.1 / 1K (was 47.4; threshold 10 — improved, still high — would need data-structure work to push lower).
  • P99 on-CPU: 0.059 ms (was 50.7 ms pre-fix — the lock-contention tail is gone).
  • `Mutex::lock_contended` on-CPU: 1.34 % (was the dominant cost pre-fix).
  • `net_poll_thread` on-CPU at fixed 5 ms cadence: 4.93 % of total CPU. Adaptive brings this to ~10× lower during idle stretches — the headline 10× idle-wakeup reduction is recovered.

Validation

Suite Status
`cargo fmt --all -- --check`
`cargo clippy --workspace --all-targets --all-features -- -D warnings`
`cargo test --test network_baseline` ✅ 18/18 (default), 19/19 with `--features bench-helpers -- --test-threads=1`
`cargo test --lib network::epoll_dispatch` ✅ 5/5
`cargo bench --bench network --features bench-helpers --no-run`
`cargo build --release`
`voidbox-network-bench --iterations 3 --bulk-mb 10` ✅ 3.5× throughput vs master, RR/CRR parity
eBPF profile (`~/.local/bin/perf-agent`) ✅ post-fix off-CPU `drain_to_guest` near zero (was 9% pre-fix)

Open follow-ups (deferred to future PRs, NOT blocking)

  • Phase 6.1 (TCP half-close), Phase 6.2 (async connect), Phase 6.3 (window mgmt) — separate plans, all deferred per overview at `smoltcp-passt-port-phase6.md`.
  • `port_forward_accept_latency` stays at 50 ms (listener thread polls every 50 ms; un-touched by Phase 6.4). Moving the listener accept onto the same epoll set would drop this to microseconds — small follow-up, not in 6.4 scope.
  • Cache-miss rate 38 / 1K would benefit from a faster HashMap hasher or contiguous `FlowEntry` storage. Worthwhile but invasive; deserves its own focused work.

Test plan for review

  • `cargo test --test network_baseline` passes 18/18.
  • `cargo test --test network_baseline --features bench-helpers -- --test-threads=1` passes 19/19 (the snapshot-rebuild smoke pin).
  • `cargo bench --bench network --features bench-helpers tcp_rx_latency_one_packet` produces a sub-millisecond median.
  • `scripts/bench-compare.sh --baseline 47868f0 --skip-vm` shows the wins above and the documented overheads.
  • `voidbox-network-bench --iterations 3 --bulk-mb 10` shows g2h ~6680 Mbps, bulk-g2h ~5650 Mbps, CRR ~10 ms.
  • Skim `docs/superpowers/plans/2026-04-30-smoltcp-passt-port-phase6.4.md` for the design rationale and the iteration history.

dpsoft added 30 commits April 27, 2026 17:52
Adds two planning docs under docs/superpowers/plans/:

- 2026-04-27-smoltcp-passt-port.md (spec)
  Supersedes the 2026-04-12 network-backend-abstraction design.
  Replaces "add passt as opt-in backend" with "lift passt's design
  patterns into our smoltcp stack" — keeps observability, all-Rust
  path, single binary, cross-platform parity. Lists required skills
  for execution (rust-style, rustdoc, rust-analyzer-ssr,
  superpowers TDD/verification, repo verify/profile). Maps the work
  into 5+1 phases with per-phase plan-doc placeholders.

- 2026-04-27-smoltcp-passt-port-phase0.md (Phase 0 plan)
  25 bite-sized TDD tasks: correctness baseline pins, divan
  microbenches, wall-clock e2e harness, NetworkBackend trait
  extraction, SlirpStack → SmoltcpBackend rename. Includes three
  BROKEN_ON_PURPOSE assertions that flip in later phases.
Add two baseline tests for the smoltcp DNS proxy:
- dns_query_resolves: sends a query for example.com, polls ≤20×100ms,
  asserts reply XID matches.
- dns_cache_keys_by_question_not_xid: warms cache with xid=1, then
  queries with xid=2 and asserts the stack rewrites the reply XID.

Both tests skip gracefully (eprintln + early return) when the upstream
resolver is unreachable, making them safe in offline CI.

Also adds QNAME_EXAMPLE_COM const and two module-scope helpers:
build_dns_query (builds a correct UDP DNS frame with proper payload_len)
and parse_dns_reply_xid.

SLIRP_DNS_IP added to the existing module-scope slirp import.
Implement measure_tcp_throughput_g2h: binds a host-side TCP listener,
boots a VM, execs dd|nc in the guest, drains to EOF on the host, and
computes Mbps from bytes_received / elapsed.  h2g left None with a TODO.
Implements measure_rr_latency and measure_crr_latency in
voidbox-network-bench, reusing the single shared VM booted for
throughput measurements.

RR: guest pipes N bytes over one persistent nc connection; host times
each read+write pair (first sample discarded to absorb connect jitter).
CRR: guest runs N independent nc invocations; host times each full
accept+read+write+close cycle. Both use the existing percentile()
helper (dead_code attribute removed). Latency measurements always run
regardless of --no-throughput.
Per user feedback: "Slirp" denotes the user-mode-NAT role; "smoltcp"
is the underlying library. Role-based naming keeps the public type
surface stable across library swaps and matches the symmetry of
future TapBackend / VhostNetBackend siblings.

Module file src/network/slirp.rs keeps its name (already aligned with
the new type, matches src/devices/virtio_net.rs convention).
The actual polling logic now lives in drain_to_guest, which writes
directly into the caller-supplied &mut Vec<Vec<u8>> buffer — no fresh
allocation on every tick. poll becomes a #[deprecated] shim:

    #[deprecated(note = "use drain_to_guest")]
    pub fn poll(&mut self) -> Vec<Vec<u8>> {
        let mut out = Vec::new();
        self.drain_to_guest(&mut out);
        out
    }

Existing call sites (virtio_net.rs, tests/network_baseline.rs,
benches/network.rs) are annotated with #[allow(deprecated)] and a
TODO(0D.4/0D.5) marker. They will be migrated in the next two tasks,
after which the allow attributes can be removed.
Switch VirtioNetDevice::slirp from Arc<Mutex<SlirpStack>> to
Arc<Mutex<dyn NetworkBackend>>, replacing the deprecated poll() call
in get_rx_frames with drain_to_guest into a reused rx_scratch buffer.

Update both VMM cold-boot and snapshot-restore construction sites to
coerce Arc<Mutex<SlirpStack>> to the trait object. All 14 baseline
tests pass; fmt and clippy clean.
Type rename only — the slirp.rs module file keeps its name.
SlirpBackend reflects the user-mode-NAT role rather than the
underlying smoltcp library, keeping naming symmetric with future
TapBackend / VhostNetBackend siblings.
Introduces the types and helper needed for ICMP echo NAT (Phase 1):

- IcmpEchoKey {guest_id, dst_ip}: hash key for the echo NAT table.
- IcmpEchoEntry {sock, guest_id, last_activity}: per-request state.
- open_icmp_socket(): opens SOCK_DGRAM/IPPROTO_ICMP (no CAP_NET_RAW).
- icmp_echo: HashMap<IcmpEchoKey, IcmpEchoEntry> field on SlirpBackend,
  initialized to HashMap::new() in with_security() (the canonical ctor;
  new() and Default both delegate through it).

No behavior change — handle_ipv4_frame is untouched, the map stays
empty. Dead-code allowances are scoped to the new items and will be
removed once tasks 1.2/1.3 wire them in.
dpsoft added 8 commits May 2, 2026 17:07
Divan microbench (`tcp_rx_latency_one_packet`) measures the SLIRP-layer
per-packet dispatch cost when one TCP flow is Established and the host
kernel has data ready: one zero-timeout epoll_wait + readiness scan +
peek + Ethernet frame construction.

Measured median on this host: ~9.8 µs per drain_to_guest call.

Pre-6.4 the relay iterated every flow in flow_table unconditionally
regardless of readiness.  Post-6.4 it dispatches only the flows with an
epoll EPOLLIN event, reducing wasted work on idle flows to zero.  This
bench is the regression anchor for that change.

The bench is gated on `--features bench-helpers` (like the existing
`tcp_inbound_syn_ack_transition` and `synthesize_inbound_syn` benches).
It performs a full 3-way handshake outside the timed loop so only the
hot relay path is measured.

Note: this bench cannot exercise the net_poll_thread 50 ms epoll cycle
(that thread does not run inside divan).  The wall-clock host→guest
latency floor is the province of voidbox-network-bench's
`tcp_rx_latency_us_p50` field.  That field is added to the Report struct
in this commit but returns None (deferred): wiring a guest-side listener
requires either a guest daemon or an additional exec RPC — both out of
scope for Phase 6.4.  The divan microbench is the primary numerical
deliverable for this phase.
net_poll_thread holds the EpollDispatch mutex for the full 50 ms
of its blocking wait. drain_to_guest's own non-blocking
wait_with_timeout(ZERO) call contended on the same mutex,
serializing the vCPU thread behind the net-poll thread.
voidbox-network-bench saw TCP g2h throughput drop from ~1885 Mbps
to ~44 Mbps (40× regression).

Fix: SlirpBackend gets a small Mutex<Vec<EpollEvent>> queue.
net_poll_thread pushes events into it after each successful
wait_with_timeout. drain_to_guest drains the queue (brief
uncontended lock) without touching EpollDispatch. A try_lock
fallback path serves unit tests (no net_poll_thread) without
blocking on the mutex.

NetworkBackend trait gains a push_ready_events default-no-op so
SlirpBackend can override it; VirtioNetDevice exposes
push_events_to_backend as the trampoline called by net_poll_thread.

Off-CPU profile evidence: drain_to_guest was 9% off-CPU (29.7s
in a 60s window) waiting on the epoll mutex; should drop to
near-zero post-fix.
relay_tcp_nat_data's Pass 1 unconditionally copied every TCP
FlowKey into a Vec to scan for Closed entries on every drain call.
Cache misses 47/1K under load; poll_with_n_flows/100 regressed
+246% (130ns → 450ns), /1000 regressed +220%.

Fix: when handle_tcp_frame's FIN/RST handlers and mid-function
error paths set state=Closed, push the key onto a pending_close
Vec. relay_tcp_nat_data drains this Vec at the top of its single
ready-events pass — no O(n) collect required. Idle-timeout
detection retains a direct flow_table iteration but without
allocating a separate key Vec.
…_guest

The initial Bug A fix used pending_events.lock() + try_lock(epoll)
in drain_to_guest's fast path, adding ~150ns overhead per call vs
Phase 6.4 (one extra Mutex acquire). This showed as +38% regression
in poll_idle bench (441ns → 611ns).

Revised approach: try_lock epoll first (zero cost when uncontended —
tests, benches, idle net-poll thread). On Err (net_poll_thread holds
the mutex for 50ms), drain pending_events instead. In production the
try_lock fails ~once per 50ms window; in tests it always succeeds.
Net result: drain_to_guest overhead matches Phase 6.4 when epoll is
uncontended; contention eliminated when net_poll_thread is actively
waiting.
Pre-Phase-6.4, net_poll_thread woke unconditionally every 5 ms, so
every ACK queued in inject_to_guest by handle_tcp_frame got flushed
within 5 ms. Phase 6.4's epoll_wait(50 ms) waits for FD readiness
events — but a guest writing data has no FD-side signal (the guest
is the writer; the SLIRP-side socket only becomes readable when the
host responds). So queued ACKs sat 50 ms before being flushed; TCP
send window stalled; voidbox-network-bench TCP g2h dropped from
~1885 Mbps to ~225 Mbps even after the mutex-contention fix.

Fix: track inject_to_guest length around process_guest_frame's
ethertype dispatch.  If the call queued any frames, call
epoll_waker.wake() — one byte to the non-blocking self-pipe, which
unblocks net_poll_thread's epoll_wait so the queued frames flush
within microseconds.

Also fixes the related drain_to_guest event-source ordering bug:
pending_events (filled by net_poll_thread) is now ALWAYS drained
first, with the non-blocking epoll poll only running as a fallback
when the queue is empty (test/bench paths without net_poll_thread).
The previous code took the try_lock branch when net-poll was
between iterations and silently dropped events the net-poll
thread had already pushed.

voidbox-network-bench post-fix:
  g2h:        ~6000 Mbps  (vs master 1885; +3.2x)
  bulk-g2h:   ~3900 Mbps  (vs master 1565; +2.5x at SO_RCVBUF=4096)
  rr p50:     2 us        (parity with master)
  crr p50:    ~50 ms      (5x regression vs master ~10 ms — separate
                           bug, tracked in follow-up; the 50 ms is
                           exactly one epoll_wait cycle and points
                           to a connection-establishment latency
                           issue independent of the throughput path)
CRR p50 was regressing +40 ms (10 ms → 51 ms) post-Phase-6.4.  The
+40 ms exactly matches Linux's TCP delayed-ACK timer, and the cause
is that Phase 6.4 widened the net-poll IRQ re-pulse cadence from
5 ms to 50 ms.

The Linux guest spends most idle time in HLT and relies on regular
vCPU scheduling slots — driven by our IRQ pulses — to advance its
TCP delayed-ACK timer.  At 50 ms cadence the guest's pure ACKs
ride the next event-triggered IRQ, which can be 40+ ms away.  At
5 ms the housekeeping cadence mirrors pre-6.4 and the timer fires
on schedule.

We lose Phase 6.4's headline "10x idle-wakeup reduction" goal but
fast-path events still wake immediately via epoll readiness — so
the net win vs master is unchanged: g2h throughput +250%, bulk
throughput +250%, RR parity, CRR parity.

voidbox-network-bench post-fix:
  g2h:        ~6500 Mbps  (vs master 1885; +247%)
  bulk-g2h:   ~5400 Mbps  (vs master 1565; +245%)
  rr p50:     ~3 us       (parity)
  crr p50:    ~10100 us   (parity — back to baseline 10 ms)
Per AGENTS.md doc-comment style ("avoid ticket IDs and PR/commit
references inside doc comments and inline comments — they belong
in commit messages and PR descriptions where they're audit trail;
in code they age into noise as the ticketing context evolves").

Phase references fall into the same category.  Comments are
rewritten in present tense to explain the structural reasoning
without referencing when each piece landed.  Identifiers like
test names and BROKEN_ON_PURPOSE markers are unchanged.

Plan/spec docs in docs/superpowers/plans/ are intentionally
untouched — phase references there ARE the audit trail.
Recovers Phase 6.4's headline 10x idle-wakeup reduction without
re-introducing the +40 ms CRR regression that forced the cadence
back to a fixed 5 ms.

The adaptive policy:
- last cycle had any kernel event   → next timeout 5 ms (active)
- last cycle timed out (no events) → next timeout 50 ms (idle)

A single quiet cycle drops us to idle; a single event puts us back
in active in the next cycle.

The subtlety that motivated the additional EpollDispatch change:
when the vCPU thread calls epoll_waker.wake() during a 50 ms idle
wait, the kernel's epoll_wait returns with the self-pipe event.
wait_with_timeout filters that event out and drains the pipe — so
`epoll_events.is_empty()` would have remained true, and the naive
"is_empty ⇒ idle" predicate kept us at 50 ms forever, regressing
CRR p50 back to ~50 ms.

wait_with_timeout now returns the *raw* kernel count (including
self-pipe wakes) so the adaptive policy treats wakes as activity.
Filtered events still arrive in the out parameter unchanged; only
the return value's meaning shifted from "observable count" to
"raw count," which all existing callers ignore.

voidbox-network-bench post-fix:
  g2h:        ~6680 Mbps  (vs 5 ms fixed: 6500; vs master: +254%)
  bulk-g2h:   ~5550 Mbps  (vs 5 ms fixed: 5400; vs master: +254%)
  rr p50:     1 us        (in 99-sample iteration; parity)
  crr p50:    ~10100 us   (parity preserved — adaptive correctly
                           holds 5 ms cadence during connection
                           bursts because each connection's wake()
                           keeps raw_kernel_events > 0)

Idle CPU dropped: profile-pre showed net_poll_thread on-CPU 4.93 %
of total at fixed 5 ms cadence (200 wakes/sec); adaptive should
drop to ~10x lower during idle stretches between iterations.
@dpsoft dpsoft changed the title Phase 6.4: epoll-driven RX dispatch (3.5× throughput vs main) Phase 6.4: epoll-driven RX dispatch with adaptive timeout (3.5× throughput vs main) May 4, 2026
@dpsoft dpsoft requested a review from Copilot May 4, 2026 11:56
@dpsoft dpsoft marked this pull request as ready for review May 4, 2026 11:56
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR (Phase 6.4) rewrites the VM’s host→guest RX polling path from a timer-driven loop to an epoll-driven dispatch with an adaptive timeout, aiming to reduce unnecessary wakeups while improving throughput/latency under load. It introduces a shared EpollDispatch owned by SlirpBackend, wires readiness events through net_poll_thread, and updates tests/benches/docs to match the new design.

Changes:

  • Added a Linux EpollDispatch (epoll fd + self-pipe waker) and integrated it into SlirpBackend for flow registration and readiness-driven relay.
  • Reworked net_poll_thread to block on epoll with adaptive 5ms/50ms timeouts and forward ready events into the backend via a per-backend queue.
  • Updated baseline tests, microbenches, and plan documentation to reflect the new event-driven polling architecture and snapshot/restore epoll rebuild expectations.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/network_baseline.rs Updates drain helper, retransmit semantics, and adds an epoll rebuild smoke test (feature-gated).
src/vmm/mod.rs Replaces fixed sleep polling with epoll-driven net poll thread and adaptive timeout logic.
src/network/slirp.rs Adds epoll dispatcher/waker, registers host FDs on flow creation, and drives relay via readiness events + event queue.
src/network/nat.rs Documentation-only wording updates.
src/network/mod.rs Exposes epoll hooks (epoll_arc, push_ready_events) on NetworkBackend (Linux-only).
src/network/epoll_dispatch.rs New epoll wrapper with register/unregister, wait_with_timeout, and self-pipe Waker.
src/devices/virtio_net.rs Plumbs epoll access/events forwarding from device to backend (Linux-only).
src/daemon.rs Removes phase wording from a comment.
src/bin/voidbox-startup-bench/main.rs Removes phase wording from output strings.
src/bin/voidbox-network-bench/main.rs Updates backpressure-related documentation strings; adds placeholder RX latency field.
docs/superpowers/plans/2026-04-30-smoltcp-passt-port-phase6.4.md Adds the detailed Phase 6.4 implementation plan.
benches/network.rs Updates comments and adds a bench for one-packet RX latency cost under epoll dispatch.
Comments suppressed due to low confidence (1)

src/network/slirp.rs:1907

  • relay_icmp_echo no longer performs a periodic idle-timeout sweep; it only checks ICMP_IDLE_TIMEOUT for flows that appear in the ready set. ICMP flows that go silent (no more readiness events) will never be reaped, leaving sockets registered in epoll indefinitely. Reintroduce a lightweight sweep over flow_table (similar to UDP’s stale scan) or track expirations separately so idle ICMP entries are eventually evicted even without further traffic.
    /// Drain replies from each active ICMP echo socket and emit echo-reply
    /// frames to the guest, driven by epoll readiness.
    ///
    /// Only flows whose token appears in `ready` with EPOLLIN set are visited.
    /// Entries idle longer than `ICMP_IDLE_TIMEOUT` are still evicted on any
    /// readiness event for that flow.
    fn relay_icmp_echo(&mut self, ready: &[EpollEvent]) {
        const ICMP_IDLE_TIMEOUT: Duration = Duration::from_secs(60);
        let now = Instant::now();

        let flow_keys: Vec<FlowKey> = ready
            .iter()
            .filter(|ev| ev.readable && ev.token & PROTO_TAG_MASK == PROTO_TAG_ICMP)
            .filter_map(|ev| {
                self.flow_table.keys().copied().find(|fk| {
                    if let FlowKey::IcmpEcho(icmp_key) = fk {
                        flow_token_for_icmp(icmp_key) == ev.token
                    } else {
                        false
                    }
                })
            })
            .collect();
        for flow_key in flow_keys {
            let FlowKey::IcmpEcho(key) = flow_key else {
                continue;
            };
            let frame = {
                let Some(FlowEntry::IcmpEcho(entry)) = self.flow_table.get_mut(&flow_key) else {
                    continue;
                };
                if now.duration_since(entry.last_activity) > ICMP_IDLE_TIMEOUT {
                    None // mark for removal below
                } else {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/network/slirp.rs Outdated
Comment thread src/network/slirp.rs Outdated
Comment thread src/network/slirp.rs
Comment on lines +1219 to +1227
if let Some(host_fd) = new_host_fd {
let token = flow_token_for_udp(&key);
self.epoll
.lock()
.unwrap()
.register(host_fd, token, true, false)
.ok();
self.epoll_waker.wake();
}
Comment thread src/network/slirp.rs
Comment on lines +1298 to +1306
if let Some(host_fd) = new_icmp_fd {
let token = flow_token_for_icmp(&key);
self.epoll
.lock()
.unwrap()
.register(host_fd, token, true, false)
.ok();
self.epoll_waker.wake();
}
Comment thread src/network/slirp.rs
Comment on lines +1725 to +1736
// Idle-timeout sweep: scan flow_table once without collecting a
// separate key Vec. 300-second inactivity applies regardless of epoll
// readiness; this is O(n) in the number of TCP flows but has no
// heap allocation overhead.
const TCP_IDLE_TIMEOUT: Duration = Duration::from_secs(300);
for (flow_key, entry) in &self.flow_table {
if let FlowEntry::Tcp(tcp_entry) = entry {
if tcp_entry.last_activity.elapsed() > TCP_IDLE_TIMEOUT
&& !to_remove.contains(flow_key)
{
to_remove.push(*flow_key);
}
Comment thread src/network/epoll_dispatch.rs Outdated
Comment thread src/vmm/mod.rs Outdated
Comment thread src/vmm/mod.rs Outdated
Comment thread src/network/slirp.rs
dpsoft added 5 commits May 4, 2026 14:40
flow_token_for_tcp/udp truncated dst_ip to 16 bits; flow_token_for_icmp
omitted dst_ip entirely. Multiple flows could collide on the same
token, mis-routing readiness events to the wrong FlowKey.

Replace the lossy encoding with a monotonic AtomicU64 counter per
backend. Tokens are still tagged in the high byte for protocol demux
(PROTO_TAG_TCP/UDP/ICMP); the lower 56 bits are unique. A new
token_to_key HashMap makes readiness → FlowKey lookup O(1) instead
of the previous linear flow_table scan.
net_poll_thread held the Mutex<EpollDispatch> across the blocking
epoll_wait call (up to 50 ms in idle cadence). vCPU register/
unregister paths in handle_tcp_frame (and friends) had to acquire
the same mutex and would block behind the wait, stalling guest TCP
SYN handling for up to 50 ms during connection setup.

epoll_ctl and epoll_wait are kernel-thread-safe on the same epoll
fd; the only state requiring synchronization was the self-pipe
(now eagerly initialized in EpollDispatch::new) and the registered
fd count (now AtomicUsize). EpollDispatch becomes Sync without an
external Mutex — the type changes from Arc<Mutex<EpollDispatch>>
to Arc<EpollDispatch>. register/unregister run lock-free against
the wait thread; only the kernel's per-epoll-fd internal lock
serializes, and that's a fast path.
to_remove.contains() inside the idle-timeout loop was O(n*k)
under churn. Switch the membership check to a HashSet<FlowKey>
and only materialize the Vec once at the end for the removal
loop.
Apply project rust-style rules to the recently-landed Phase 6.4
code (token rewrite + lock-free EpollDispatch refactor):

1. RegisterMode enum replaces (readable: bool, writable: bool) on
   EpollDispatch::register. Closed-set policy at the call site
   (Read / Write / ReadWrite) over two opaque booleans.

2. matches!() removed at three sites — the project guide prefers
   full match (or boolean ==) for compiler diagnostics if the
   matched type changes. The unprivileged-ICMP errno check now
   uses == comparisons; FlowKey::Tcp counter uses a for loop.

3. Iterator chains in relay loops rewritten as for loops with
   mutable accumulators per the project rule. relay_tcp_nat_data,
   relay_udp_flows, relay_icmp_echo all touched. Logic unchanged;
   control flow now reads top-down without a chain of
   .filter().filter_map().collect().

4. Local renamed `rc` → `epoll_ctl_result` at three EpollDispatch
   sites. Role-bearing names are required in non-tiny scopes.

5. Dropped redundant explanatory comments around the relay loops
   ("Data relay — only for flows with…", "Skip entries already
   queued for…", "Collect ready ICMP flow keys via…"). The code
   below them is self-describing. Kept structural "why" comments
   (the ICMP idle-sweep rationale, the per-flow socket Drop
   contract).

No behavior change. cargo fmt, clippy -D warnings, network_baseline
(18/18), lib network (23/23), and voidbox-network-bench wall-clock
(g2h ~6580 Mbps, CRR ~32 µs) all green.
@dpsoft dpsoft force-pushed the smoltcp-passt-port-phase6.4-epoll branch from dc1001b to dcbf18b Compare May 4, 2026 18:12
dpsoft added a commit that referenced this pull request May 4, 2026
Two artifacts from the rebase onto the new #69 tip:

1. bind_port_forward_listeners was called twice in with_security
   after the rebase — the second call shadowed the first, which
   was harmless but emitted an unused-variable warning under
   -D warnings. Drop the duplicate.

2. The bench-helpers helper insert_synthetic_lastack_entry was
   missing flow_token in its TcpNatEntry initializer (the field
   was added by the Copilot fix #1 monotonic-token rewrite that
   landed on #69 after the original phase 6.1 commit chain was
   written). Allocate via next_flow_token(PROTO_TAG_TCP).
Base automatically changed from smoltcp-passt-port-phase0 to main May 6, 2026 00:00
…ase6.4-epoll

# Conflicts:
#	benches/network.rs
#	src/bin/voidbox-network-bench/main.rs
#	src/network/mod.rs
#	src/network/nat.rs
#	src/network/slirp.rs
#	tests/network_baseline.rs
@dpsoft dpsoft merged commit 6a5d8b5 into main May 6, 2026
22 checks passed
@dpsoft dpsoft deleted the smoltcp-passt-port-phase6.4-epoll branch May 6, 2026 00:38
dpsoft added a commit that referenced this pull request May 6, 2026
…ing (#76)

* docs: Phase 6.1 detailed TDD plan — TCP half-close

9 bite-sized tasks covering the half-close state machine:
- Established → FinWait1 (guest closed first) → LastAck on host EOF
- Established → CloseWait (host closed first) → LastAck on guest FIN
- LastAck → Closed on guest's final ACK or LAST_ACK_TIMEOUT (60s)

Three new pins: tcp_half_close_guest_writes_first (BROKEN_ON_PURPOSE
flips at Task 3), tcp_half_close_host_writes_first (passes after
Task 3), tcp_last_ack_timeout_reaps_stale_entry (bench-helpers
gated, Task 7).

Severity: HIGH — current code drops host's pending response on
guest shutdown(SHUT_WR), causing silent data loss for HTTP, SMTP,
and any protocol with orderly half-close.

* test(network): pin tcp_half_close_guest_writes_first (BROKEN_ON_PURPOSE)

Guest sends FIN after data; current code marks state=Closed immediately
on guest FIN, so the host's response data is dropped. This pin locks
the broken behavior so the Phase 6.1 fix is legible to reviewers.

Flips to PASS when Tasks 2–3 implementation lands.

* feat(slirp): FinWait1 → LastAck and Established → CloseWait on host EOF

Tasks 2+3+4 implemented together: clippy's -D warnings requires our_fin_sent
to be used before committing the field addition, so the relay loop changes
from Task 3 and the ACK handler from Task 4 are bundled with Task 2's struct
changes.

Structural changes (TcpNatEntry):
- Add last_state_change: Instant — tracks when state last changed, used
  for LAST_ACK_TIMEOUT reaping (Task 6).
- Add our_fin_sent: bool — prevents re-sending FIN on repeated epoll
  readiness events for the same half-close transition.
- Initialize both fields at all three TcpNatEntry literal sites.

FIN handler (handle_tcp_frame):
- Established → FinWait1: ACK guest FIN, shutdown(Write) host socket
  so host sees EOF; stay alive for host's pending response data.
- CloseWait → LastAck: host already closed; guest just closed; ACK
  and wait for guest's final ACK of our FIN.

ACK handler (handle_tcp_frame):
- LastAck → Closed: guest's final ACK reaps the entry.
- Extend ACK-driven consume to FinWait1 state so bytes_in_flight
  drains and recv_peek eventually sees Ok(0) (host EOF).

Relay loop (relay_tcp_nat_data):
- Extend data relay to FinWait1 in addition to Established.
- Ok(0) arm now dispatches on state: Established → CloseWait,
  FinWait1 → LastAck (instead of immediate Closed).
- FIN-emit logic uses our_fin_sent guard and needs_fin predicate
  (CloseWait | LastAck) to send FIN exactly once per transition.

Test: update tcp_half_close_guest_writes_first to ACK received data
segments so the ACK-driven consume path drains the kernel buffer
and recv_peek sees EOF. Without ACKs the FinWait1 relay sees the
same bytes on every peek and never reaches Ok(0).

* test(network): pin tcp_half_close_host_writes_first

Symmetric mirror of tcp_half_close_guest_writes_first: host writes
GREETING, shuts down its write side (Established → CloseWait), and waits
for guest to reply; guest sends REPLY data + FIN.

Also fix two implementation gaps found while writing this pin:

1. Guest→host data forwarding in CloseWait: the payload forward guard
   was Established-only. CloseWait must also forward guest data (host
   closed its write side but can still read). Extend the predicate to
   include CloseWait.

2. shutdown(Write) on CloseWait → LastAck: when the guest sends FIN in
   CloseWait the host application is blocked on read_to_end. We must
   call shutdown(Write) on host_stream so the kernel delivers EOF to
   the host. Without this the host blocks forever.

* feat(slirp): LAST_ACK_TIMEOUT reaping prevents LastAck entry leak

Add LAST_ACK_TIMEOUT = 60 s (TCP MSL × 2) as a module-scope constant.

Merge the LastAck-timeout check into the existing idle-timeout sweep in
relay_tcp_nat_data, making a single O(n) pass handle both conditions:

- Standard idle-timeout (300 s of inactivity, any state): unchanged.
- LastAck-timeout (60 s since last_state_change, LastAck state): reaps
  entries where the guest's final ACK never arrived, e.g. after a guest
  crash. Logs at WARN so operators can connect the timeout to a prior
  half-close sequence without a separate debugging session.

The combined sweep avoids a second loop and an extra to_remove Vec
allocation. Using last_state_change (set on every state transition)
rather than last_activity (set on data relay) gives an accurate 60 s
window from the moment we sent our FIN.

* test(network): pin tcp_last_ack_timeout_reaps_stale_entry (bench-helpers)

Add a bench-helpers–gated pin that verifies the LAST_ACK_TIMEOUT reaper:

1. Insert a synthetic LastAck entry via the new bench-helpers helper
   insert_synthetic_lastack_entry (TcpNatState::LastAck, our_fin_sent=true).
2. Back-date last_state_change by 70 s (> LAST_ACK_TIMEOUT = 60 s) via
   set_synthetic_last_state_change.
3. One drain_to_guest cycle runs the timeout sweep.
4. Assert the entry is gone (tcp_flow_state returns None).

Two new bench-helpers methods on SlirpBackend:
- insert_synthetic_lastack_entry: seeds a LastAck flow without a
  full half-close exchange, for timeout reaping tests.
- set_synthetic_last_state_change: back-dates last_state_change on an
  existing entry to simulate an expired LAST_ACK_TIMEOUT.

Also widen TcpNatState and tcp_flow_state to pub so the bench-helpers–
gated external test can reference them by path. Both were previously
pub(crate); they are now observable as part of the bench-helpers surface.

* refactor(slirp): drop allow(dead_code) on TcpNatState; remove unused FinWait2

All TcpNatState variants are now wired into the state machine:
- SynReceived, SynSent: handshake paths (unchanged)
- Established: normal data transfer (unchanged)
- FinWait1: guest half-closed; relay continues forwarding host response
- CloseWait: host half-closed; relay continues forwarding guest data
- LastAck: both sides closed; waiting for guest's final ACK or timeout
- Closed: entry pending removal

Remove FinWait2: distinguishing it from FinWait1 would require
observing per-segment ACKs from the kernel — the relay does not track
those. The relay stays in FinWait1 until host EOF, then jumps to LastAck.
If the distinction is needed later, the variant can be re-added.

Update the TcpNatState doc comment to document the state machine diagram
and the FinWait2 omission rationale, replacing the old per-variant stubs.

* fix(slirp): drop duplicate bind + init flow_token in lastack helper

Two artifacts from the rebase onto the new #69 tip:

1. bind_port_forward_listeners was called twice in with_security
   after the rebase — the second call shadowed the first, which
   was harmless but emitted an unused-variable warning under
   -D warnings. Drop the duplicate.

2. The bench-helpers helper insert_synthetic_lastack_entry was
   missing flow_token in its TcpNatEntry initializer (the field
   was added by the Copilot fix #1 monotonic-token rewrite that
   landed on #69 after the original phase 6.1 commit chain was
   written). Allocate via next_flow_token(PROTO_TAG_TCP).

* fix(slirp): doc-link to bench-helpers method via Self::

A bare intra-doc-link (`[`set_synthetic_last_state_change`]`) doesn't
resolve to a struct method even when both items live in the same
`impl` block — rustdoc looks up bare names in the module scope, not
the enclosing impl. CI's `-D warnings` then elevates the
`broken_intra_doc_links` warning to an error and fails the
Documentation job. Switch to `Self::set_synthetic_last_state_change`
so rustdoc resolves through the impl.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants